41 research outputs found

    Using Fingerprints To Position Advertising Content

    Get PDF
    Disclosed herein is a mechanism for using content-based video and audio fingerprints to find suitable locations for placing midroll advertising content. The mechanism can include assigning penalty information to different locations within a video, where the penalty information can be a combination of an audio penalty score and a video penalty score. The mechanism can use the penalty information to determine a suitable location for placing or positioning content within the video, such as midroll advertising content

    Miss Rate Prediction across All Program Inputs

    Get PDF
    Improving cache performance requires understanding cache behavior. However, measuring cache performance for one or two data input sets provides little insight into how cache behavior varies across all data input sets. This paper uses our recently published locality analysis to generate a parameterized model of program cache behavior. Given a cache size and associativity, this model predicts the miss rate for arbitrary data input set sizes. This model also identifies critical data input sizes where cache behavior exhibits marked changes. Experiments show this technique is within 2% of the hit rate for set associative caches on a set of integer and floating-point programs

    Dynamically Matching ILP Characteristics Via a Heterogeneous Clustered Microarchitecture

    Get PDF
    Applications vary in the degree of instruction level parallelism (ILP) available to be exploited by a superscalar processor. The ILP can also vary significantly within an application. On one end of the microarchitecture space are monolithic superscalar designs that exploit parallelism within an application. At another end of the spectrum are clustered architectures having many simple cores that can be clocked at a higher frequency than a comparable monolithic design. A disadvantage of the clustered design is the cost to transmit results between clusters which potentially limits the performance even in high ILP applications if instructions are not mapped carefully to minimize cross communication. In this paper, we propose an approach that incorporates the strengths of the prior two by clustering multiple integer ALUs in an asymmetric fashion. In one cluster, a 6-way out-of-order pipeline efficiently executes code having high ILP. In another cluster, a simpler, but deeper, 2-way pipeline running at twice the frequency speeds up regions of code having little ILP. We use the parallelism metrics gathered from the dynamic Data Dependence Tracking (DDT) mechanism to dynamically steer instruction windows to a cluster. We demonstrate that the heterogeneous cluster design can improve performance by up to 30% over a monolithic 8- way superscalar processor

    Dynamic Data Dependence Tracking and Its Application to Branch Prediction

    Get PDF
    To continue to improve processor performance, microarchitects seek to increase the effective instruction level parallelism (ILP) that can be exploited in applications. A fundamental limit to improving ILP is data dependences among instructions. If data dependence information is available at run-time, there are many uses to improve ILP. Prior published examples include decoupled branch exectuion architectures and critical instruction detection. In this paper, we describe an efficient hardware mechanism to dynamically track the data dependence chains of the instructions in the pipeline. This information is available on a cycle-by-cycle basis to the microengine for optimizing its perfromance. We then use this design in a new value-based branch prediction design using Available Register Value Information (ARVI). From the use of data dependence information, the ARVI branch predictor has better prediction accuracy over a comparably sized hybrid branch perdictor. With ARVI used as the second-level branch predictor, the improved prediction accuracy results in a 12.6% performance improvement on average across the SPEC95 integer benchmark suite

    AOP-Based Caching of Dynamic Web Content: Experience with J2EE Applications

    Get PDF
    Caching dynamic web content is an appealing approach to reduce Internet latency and server load. In aspect-oriented programming, caching is usually presented as an orthogonal aspect that could be automatically integrated to an application. A classical AOP motivating example is adding caching of static data with no underlying consistency. But what about caching dynamic data? In this paper, we explore the feasibility of aspectizing consistent caching of dynamically generated web documents. We use two J2EE web applications to validate our experiments: the TPC-W on-line bookstore and the RUBiS auction site. To the question "Can we consider consistent caching of dynamic web content as a separate aspect that could be transparently and efficiently integrated to a dynamic web application?", our conclusions are the following: (a) Just as in the classic AOP caching example having no consistency management, AOP provides a modular way to add caching having a strong consistency policy. (b) However, maintaining strong consistency on web pages results in prohibitively expensive run-time processing and, thus, any straightforward implementation in AOP is too slow. We propose an optimization that essentially eliminates all the run-time overhead in practice. (c) Furthermore, we identify in-stances where consistent web caching may not be orthogonal to J2EE applications, especially for those applications that rely on sophisticated web techniques (e.g., cookies). In summary, adding caching supporting strong consistency using AOP turned out to be an unexpected chal-lenge

    Improving Application Performance by Dynamically Balancing Speed and Complexity in a GALS Microprocessor

    Get PDF
    Microprocessors are traditionally designed to provide “best overall” performance across a wide range of applications and operating environments. Several groups have proposed hardware techniques that save energy by “downsizing” hardware resources that are underutilized by particular applications. We explore the converse: “upsizing” hardware resources in order to improve performance relative to an aggressively clocked baseline processor. Our proposal depends critically on the ability to change frequencies independently in separate domains of a globally asynchronous, locally synchronous (GALS) microprocessor. We use a variant of our multiple clock domain (MCD) processor, with four independently clocked domains. Each domain is streamlined with modest hardware structures for very high clock frequency. Key structures can then be upsized on demand to exploit more distant parallelism, improve branch prediction, or increase cache capacity. Although doing so requires decreasing the associated domain frequency, other domain frequencies are unaffected. Measuring across a broad suite of application benchmarks, we find that configuring just once per application increases performance by an average of 17.6% compared to the best fully synchronous design. When adapting to application phases, performance improves by over 20%

    Managing Static Leakage Energy in Microprocessor Functional Units

    Get PDF
    Static energy due to subthreshold leakage current is projected to become a major component of the total energy in high performance microprocessors. Many studies so far have examined and proposed techniques to reduce leakage in on-chip storage structures. In this study, static energy is reduced in the integer functional units by leveraging the unique qualities of dual threshold voltage domino logic. Domino logic has desirable properties that greatly reduce leakage current while providing fast propagation times. However, due to the energy cost of entering the low leakage current state (sleep mode), domino logic has thus far been used only for leakage reduction in the longterm standby mode. We examine the utility of the sleep mode (while considering the aforementioned costs) when idle times are relatively short, one to a few hundred cycles, as is often the case for functional units. Using an analytical energy model suitable for architecture-level analysis, we explore the interaction of the application and technology, and the effect on energy and performance as the underlying parameters are varied, on a set of benchmarks. Our results show that if the leakage approaches the magnitude as projected in the literature, even for short idle intervals as few as ten cycles, an aggressive policy of activating the sleep mode at every idle period performs well and a more complex control strategy may not be warranted. We also propose a simple design, called Gradual Sleep, to reduce the energy impact of using the sleep mode for smaller idle periods

    Profile-based Dynamic Voltage and Frequency Scaling for a Multiple Clock Domain Microprocessor

    Get PDF
    A Multiple Clock Domain (MCD) processor addresses the challenges of clock distribution and power dissipation by dividing a chip into several (coarse-grained) clock domains, allowing frequency and voltage to be reduced in domains that are not currently on the application’s critical path. Given a reconfiguration mechanism capable of choosing appropriate times and values for voltage/frequency scaling, an MCD processor has the potential to achieve significant energy savings with low performance degradation. Early work on MCD processors evaluated the potential for energy savings by manually inserting reconfiguration instructions into applications, or by employing an oracle driven by off-line analysis of (identical) prior program runs. Subsequent work developed a hardware-based on-line mechanism that averages 75–85% of the energy-delay improvement achieved via off-line analysis. In this paper we consider the automatic insertion of reconfiguration instructions into applications, using profiledriven binary rewriting. Profile-based reconfiguration introduces the need for “training runs” prior to production use of a given application, but avoids the hardware complexity of on-line reconfiguration. It also has the potential to yield significantly greater energy savings. Experimental results (training on small data sets and then running on larger, alternative data sets) indicate that the profile-driven approach is more stable than hardware-based reconfiguration, and yields virtually all of the energy-delay improvement achieved via off-line analysis

    Dynamically Trading Frequency for Complexity in a GALS Microprocessor

    Get PDF
    Microprocessors are traditionally designed to provide “best overall” performance across a wide range of applications and operating environments. Several groups have proposed hardware techniques that save energy by “downsizing” hardware resources that are underutilized by the current application phase. Others have proposed a different energy-saving approach: dividing the processor into domains and dynamically changing the clock frequency and voltage within each domain during phases when the full domain frequency is not required. What has not been studied to date is how to exploit the adaptive nature of these approaches to improve performance rather than to save energy. In this paper, we describe an adaptive globally asynchronous, locally synchronous (GALS) microprocessor with a fixed global voltage and four independently clocked domains. Each domain is streamlined with modest hardware structures for very high clock frequency. Key structures can then be upsized on demand to exploit more distant parallelism, improve branch prediction, or increase cache capacity. Although doing so requires decreasing the associated domain frequency, other domain frequencies are unaffected. Our approach, therefore, is to maximize the throughput of each domain by finding the proper balance between the number of clock periods, and the clock frequency, for each application phase. To achieve this objective, we use novel hardware-based control techniques that accurately and efficiently capture the performance of all possible cache and queue configurations within a single interval, without having to resort to exhaustive online exploration or expensive offline profiling. Measuring across a broad suite of application benchmarks, we find that configuring our adaptive GALS processor just once per application yields 17.6% better performance, on average, than that of the “best overall” fully synchronous design. By adapting automatically to application phases, we can increase this advantage to more than 20%

    Dynamic Frequency and Voltage Control for a Multiple Clock Domain Microarchitecture

    Get PDF
    We describe the design, analysis, and performance of an on-line algorithm to dynamically control the frequency/voltage of a Multiple Clock Domain (MCD) microarchitecture. The MCD microarchitecture allows the frequency/voltage of micrprocessor regions to be adjusted independently and dynamically, allowing enery savings when the frequency of some regions can be reduced without significantly impacting performance. Our algorithm achieves on average a 19.0% reduction in Energy Per Instriction (EPI), a 3.2% increase in Cycles Per Instruction (CPI), a 16.7% improvement in Energy–Delay Product, and a Power Savings to Performance Degradation ratio of 4.6. Traditional frequency/voltage scaling techniques which apply reductions globally to a fully synchronous processor achieve a Power Savings to Performance Degradation ratio of only 2–3. Our Energy–Delay Product improvement is 85.5% of what has been achieved using an off–line algorithm. These results were achieved using a broad range of applications from the MediaBench, Olden, and Spec2000 benchmark suites using an algorithm we show to require minimal hardware resources
    corecore